-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46648][SQL] Use zstd as the default ORC compression
#44654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…compression.codec`
zstd as the default value of spark.sql.orc.compression.codeczstd as the default ORC compression
|
Thank you, @HyukjinKwon . The PR description is updated. |
|
Running benchmark: TPCDS Snappy The benchmark's name needs to be updated. |
|
Yes, it's a hard-coded one. spark/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala Line 111 in 85b504d
Since it's orthogonal from this configuration change PR, I'll handle it in a new
|
yaooqinn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dongjoon-hyun, LGTM.
|
Thank you, @yaooqinn . Here is the PR to address your comment. |
beliefer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Late LGTM.
…benchmarks ### What changes were proposed in this pull request? This PR aims to use the default ORC compression in data source benchmarks. ### Why are the changes needed? Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec. - apache/orc#1733 - #44654 `OrcReadBenchmark` was switched to use ZStandard for comparision. - #44761 And, this PR aims to change the remaining three data source benchmarks. ``` $ git grep OrcCompressionCodec | grep Benchmark sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala: .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName()) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44777 from dongjoon-hyun/SPARK-46752. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…benchmarks This PR aims to use the default ORC compression in data source benchmarks. Apache ORC 2.0 and Apache Spark 4.0 will use ZStandard as the default ORC compression codec. - apache/orc#1733 - apache#44654 `OrcReadBenchmark` was switched to use ZStandard for comparision. - apache#44761 And, this PR aims to change the remaining three data source benchmarks. ``` $ git grep OrcCompressionCodec | grep Benchmark sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BuiltInDataSourceWriteBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala: OrcCompressionCodec.SNAPPY.lowerCaseName()).orc(dir) sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala:import org.apache.spark.sql.execution.datasources.orc.OrcCompressionCodec sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala: .setIfMissing("orc.compression", OrcCompressionCodec.SNAPPY.lowerCaseName()) ``` No. Manual review. No. Closes apache#44777 from dongjoon-hyun/SPARK-46752. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR aims to use
zstdas the default ORC compression.Note that Apache ORC v2.0 also uses
zstdas the default compression via ORC-1577.The following was the presentation about the usage of ZStandard.
Why are the changes needed?
In general,
ZStandardis better in terms of the file size.As a result, the performance is also better in general in the cloud storage .
Does this PR introduce any user-facing change?
Yes, the default ORC compression is changed.
How was this patch tested?
Pass the CIs.
Was this patch authored or co-authored using generative AI tooling?
No.